Distributed Deep learning

Resources

Talks

Code

See AI/Data Engineering/Tensorflow#Distributed training

#CODE Analytics Zoo
- Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
- https://analytics-zoo.readthedocs.io/en/latest/index.html
#CODE Horovod
#CODE Colossal-AI: A Unified Deep Learning System for Large-Scale Parallel Training
- See #^colossalai
- https://www.marktechpost.com/2021/10/31/researchers-introduce-colossal-ai-a-pytorch-based-deep-learning-system-for-large-scale-parallel-training/

References

#PAPER Evaluation of Deep Learning Frameworks over Different HPC Architectures (Shams 2017)
#PAPER Deep Learning at 15PF: Supervised and Semi-Supervised Classification for Scientific Data (Kurth 2017)
#PAPER Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis (Tal Ben-Nun and Torsten Hoefler 2018)
- #TALK Hoefler 2018
- #TALK Hoefler 2020
- #TALK Ben-Nun 2020
#PAPER Mesh-TensorFlow: Deep Learning for Supercomputers (Shazeer 2018)
- #TALK https://www.youtube.com/watch?v=HgGyWS40g-g
- #CODE Mesh-TensorFlow
  - Go beyond data-parallel training
  - More sophisticated parallel computations (big models that do not fit on one device)
#PAPER GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism (Huang 2019)
#PAPER A Quantitative Study of Deep Learning Training on Heterogeneous Supercomputers (Han 2019)
- http://people.cs.vt.edu/~butta/docs/cluster2019-DL.pdf
#PAPER Channel and filter parallelism for large-scale CNN training (Dryden 2019)
- https://ndryden.com/data/papers/sc2019-chanfilt.pdf
#PAPER Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism (Dryden 2019)
#PAPER Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training (Li 2019)
#PAPER Scalable Deep Learning on Distributed Infrastructures: Challenges, Techniques and Tools (Mayer 2019)
#PAPER Performance Analysis of Deep Learning Workloads on Leading-edge Systems (Ren 2019)
#PAPER TensorFlow on State-of-the-Art HPC Clusters: A Machine Learning use Case (Ramirez-Gargallo 2019)
- https://core.ac.uk/download/pdf/196280993.pdf
- Compared MN4, Power9 and Dibona HPC clusters. Only CPUs compared (Power9 GPUs are not evaluated)
#PAPER Exascale Deep Learning for Scientific Inverse Problems (Laanait 2019)
#PAPER TensorFlow Doing HPC (Chien 2019)
#PAPER ZeRO: memory optimizations toward training trillion parameter models (Rajbhandari 2019)
- #CODE DeepSpeed
  - DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective. For pytorch
  - www.deepspeed.ai/
- https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/
#PAPER Towards a Scalable and Distributed Infrastructure for Deep Learning Applications (Hasheminezhad 2020)
- Phylanx Deep Learning Framework
- Good comparison with respect to SOTA
- Phylanx provides a high-productivity debugable Python-based interactive interface, JetLag
- Tests only on CPU. Does it support GPUs?
#PAPER Distributed Training of Deep Learning Models: A Taxonomic Perspective (Langer 2020)
#PAPER Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training (Bian 2021)
#PAPER Pathways: Asynchronous Distributed Dataflow for ML (Barham 2022)
#PAPER Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? (Tay 2022)